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Abstract 



A possibly immortal agent tries to maximise its summed discounted re- 
Q I wards over time, where discounting is used to avoid infinite utilities and en- 

courage the agent to value current rewards more than future ones. Some 
commonly used discount functions lead to time-inconsistent behavior where 
^ I the agent changes its plan over time. These inconsistencies can lead to very 

poor behavior. We generalise the usual discounted utility model to one where 
i/^ . the discount function changes with the age of the agent. We then give a sim- 

•^ I pie characterisation of time- (in) consistent discount functions and show the 

C^ ' existence of a rational policy for an agent that knows its discount function is 

time-inconsistent . 
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1 Introduction 

The goal of an agent is to maximise its expected utility; but how do we measure 
utility? One method is to assign an instantaneous reward to particular events, such 
as having a good meal, or a pleasant walk. It would be natural to measure the 
utility of a plan (policy) by simply summing the expected instantaneous rewards, 
but for immortal agents this may lead to infinite utility and also assumes rewards 
are equally valuable irrespective of the time at which they are received. 

One solution, the discounted utility (DU) model introduced by Samuelson in 
[Sam37], is to take a weighted sum of the rewards with earlier rewards usually 
valued more than later ones. 

There have been a number of criticisms of the DU model, which we will not 
discuss. For an excellent summary, see [FOO02]. Despite the criticisms, the DU 
model is widely used in both economics and computer science. 

A discount function is time-inconsistent if plans chosen to maximise expected 
discounted utility change over time. For example, many people express a preference 
for $110 in 31 days over $100 in 30 days, but reverse that preference 30 days later 
when given a choice between $110 tomorrow or $100 today [GFM94]. This behavior 
can be caused by a rational agent with a time-inconsistent discount function. 

Unfortunately, time-inconsistent discount functions can lead to extremely bad 
behavior and so it becomes important to ask what discount functions are time- 
inconsistent. 

Previous work has focussed on a continuous model where agents can take actions 
at any time in a continuous time-space. We consider a discrete model where agents 
act in finite time-steps. In general this is not a limitation since any continuous 
environment can be approximated arbitrarily well by a discrete one. The discrete 
setting has the advantage of easier analysis, which allows us to consider a very 
general setup where environments are arbitrary finite or infinite Markov decision 
processes. 

Traditionally, the DU model has assumed a sliding discount function. Formally, 
a sequence of instantaneous utilities (rewards) R = (rfc,rjfc+i,r,fc+2, ■ ' ' ) starting at 
time k, is given utility equal to Yl'^k d-t^k''^t where d G [0, 1]°°. We generalise this 
model as in [Hut06] by allowing the discount function to depend on the age of the 
agent. The new utility is given by Yl't^k ^tf^t- This generalisation is consistent with 
how some agents tend to behave; for example, humans becoming temporally less 
myopic as they grow older. 

Strotz [Str55] showed that the only time-consistent sliding discount function 
is geometric discounting. We extend this result to a full characterisation of time- 
consistent discount functions where the discount function is permitted to change over 
time. We also show that discounting functions that are "nearly" time-consistent give 
rise to low regret in the anticipated future changes of the policy over time. 

Another important question is what policy should be adopted by an agent that 
knows it is time-inconsistent. For example, if it knows it will become temporarily 



myopic in the near future then it may benefit from paying a price to pre-commit 
to following a particular policy. A number of authors have examined this question 
in special continuous cases, including [G0I8O, PY73, P0I68, Str55]. We modify their 
results to our general, but discrete, setting using game theory. 

The paper is structured as follows. First the required notation is introduced 
(Section 2). Example discount functions and the consequences of time-inconsistent 
discount functions are then presented (Section 3). We next state and prove the 
main theorems, the complete classification of discount functions and the continuity 
result (Section 4). The game theoretic view of what an agent should do if it knows 
its discount function is changing is analyzed (Section 5). Finally we offer some 
discussion and concluding remarks (Section 6). 

2 Notation and Problem Setup 

The general reinforcement learning (RL) setup involves an agent interacting sequen- 
tially with an environment where in each time-step t the agent chooses some action 
at G A, whereupon it receives a reward r^ G 7^ C M and observation Ot G O. 
The environment can be formally defined as a probability distribution fi where 
f^{rtOt\airiOia2r202 ■ ■ ■ at-iTt-iOt-iat) is the probability of receiving reward r^ and 
observation Ot having taken action at after history h^t '■= Oi^iOi ■ ■ ■ af_ir(_iOt_i. 
For convenience, we assume that for a given history h^t and action at, that r^ is 
fixed (not stochastic). We denote the set of all finite histories Ti := {A x TZ x O)* 
and write hi,t to be a history of length t, h^t to be a history of length t — 1. a^, 
Tfc, and Ok are the kth action/reward/observation tuple of history h and will be 
used without explicitly redefining them (there will always be only one history "in 
context"). 

A deterministic environment (where every value of 
/i(-) is either 1 or 0) can be represented as a graph with 
edges for actions, rewards of each action attached to the 
corresponding edge, and observations in the nodes. For 
example, the deterministic environment on the right rep- 
resents an environment where either pizza or pasta must 

be chosen at each time-step (evening). An action leading to an upper node is eat 
pizza while the ones leading to a lower node are eat pasta. The rewards are for a 
consumer who prefers pizza to pasta, but dislikes having the same food twice in a 
row. The starting node is marked as S. This example, along with all those for the 
remainder of this paper, does not require observations. 

The following assumption is required for clean results, but may be relaxed if an 
e of slop is permitted in some results. 

Assumption 1. We assume that A and O are finite and that TZ = [0, 1]. 

Definition 2 (Policy). A policy is a mapping vr : "H — ?■ ^ giving an action for each 
history. 




Given policy n and history hi;t and s < t then the probabihty of reaching history 
hi:t when starting from history h^s is P{hs:t\h<si'^) which is defined by, 

t 
P{h,,t\h<s,r[) ■= JJ/i(^fcOfc|/i<fc7r(/i<fc)). (1) 

k=s 

If s = 1 then we abbreviate and write P{hi.t\TT) := P{hi-t\h<:i,7f). 

Definition 3 (Expected Rewards). When applying policy n starting from history 
h^t, the expected sequence of rewards R^{h^t) G [0, 1]°°, is defined by 



R''{h^t)k:=J2P{huk\h<t,7i] 



Tk- 



lik<t then R'{h<t)k ■= 0. 

Note while the set of all possible h^k & {AxTZ x (9)*-'"*+^ is uncountable due to 
the reward term, we sum only over the possible rewards which are determined by 
the action and previous history, and so this is actually a finite sum. 

Definition 4 (Discount Vector). A discount vector d E [0,1]°° is a vector 
[di, ^2, d^r ' '] satisfying d^ > for at least one t > k. 

The apparently superfiuous superscript k will be useful later when we allow the 
discount vector to change with time. We do not insist that the discount vector be 
summable, Yl't^k df < oo. 

Definition 5 (Expected Values). The expected discounted reward (or utility or 
value) when using policy vr starting in history /i<i and discount vector d is 

oo oo 

i=l i=t 

The sum can be taken to start from t since R^{h^t)i = for i < t. This means 
that the value of df for t < k is unimportant, and never will be for any result 
in this paper. As the scalar product is linear, a scaling of a discount vector has 
no affect on the ordering of the policies. Formally, if VJ^{h^t) > yjk{h<t) then 
0(/^<t) > C^(/^<t) for all a > 0. 

Definition 6 (Optimal Policy /Value) . In general, our agent will try to choose a 
policy '^^*^. to maximise VTfe(/i<t). This is defined as follows. 

'^*dAh<t) ■= argmax1/Jfc(/i<t), R*^k{h<t) ■= W'^'^ih^t), 

TV 

K,*.(M:=V'(M- 



If multiple policies are optimal then ^T*^. is chosen using some arbitrary rule. 
Unfortunately, tt\ need not exist without one further assumption. 

Assumption 7. For all vr and A; > 1, hmi_>oo Xl/i t -P(^<tl^)K7fc(^<t) — 0- 
Assumption 7 appears somewhat arbitrary. We consider: 

1. For summable d the assumption is true for all environments. With the ex- 
ception of hyperbolic discounting, all frequently used discount vectors are 
summable. 

2. For non-summable discount vectors d ' the assumption implies a restriction 
on the possible environments. In particular, they must return asymptotically 
lower rewards in expectation. This restriction is necessary to guarantee the 
existence of the value function. 

From now on, including in theorem statements, we only consider environ- 
ments/discount vectors satisfying Assumptions 1 and 7. The following theorem 
then guarantees the existence of 7r*fe . 

Theorem 8 (Existence of Optimal Policy), tt*^ exists for any environment and 
discount vector d satisfying Assumptions 1 and 7. 

The proof of the existence theorem is in the appendix. 

An agent can use a different discount vector d for each time k. This motivates 
the following definition. 

Definition 9 (Discount Matrix). A discount matrix d is a oo x oo matrix with 
discount vector d for the /cth column. 

It is important that we distinguish between a discount matrix d (written bold), 
a discount vector d^ (bold and italics), and a particular value in a discount vector 

d'l (just itahcs). 

Definition 10 (Sliding Discount Matrix). A discount matrix d is sliding if d^j^^ = 
d]^^ for all A;, t > 1. 

Definition 11 (Mixed Policy). The mixed policy is the policy where at each time 
step t, the agent acts according to the possibly different policy 7r*i. 



Mh<t) ■■= T^*Ah<t) Rdih<t) ■■= R'^ih 



<t) 



We do not denote the mixed policy by vr^ as it is arguably not optimal as dis- 
cussed in Section 5. While non-unique optimal policies n*^ at least result in equal 
discounted utilities, this is not the case for 71^- All theorems are proved with respect 
to any choice VTd. 



Definition 12 (Time Consistency). A discount matrix d is time consistent if and 
only if for all environments 7r*i.(/i<t) = tt* (/i<t), for all /i<j where t > k,j. 

This means that a time-consistent agent taking action 7r*t(/i<i) at each time t 
will not change its plans. On the other hand, a time-inconsistent agent may at time 
1 intend to take action a should it reach history /i<i (vr*o(/i<j) = a). However upon 
reaching h^t, it need not be true that vr*t(/i<i) = a. 

3 Examples 

In this section we review a number of common discount matrices and give an example 
where a time-inconsistent discount matrix causes very bad behavior. 

Constant Horizon. Constant horizon discounting is where the agent only cares 
about the future up to H time-steps away, defined hj df = {t — k < H}} Shortly 
we will see that the constant horizon discount matrix can lead to very bad behavior 
in some environments. 

Fixed Lifetime. Fixed lifetime discounting is where an agent knows it will not care 
about any rewards past time-step m, defined by d^ = |t < m]. Unlike the constant 
horizon method, a fixed lifetime discount matrix is time-consistent. Unfortunately it 
requires you to know the lifetime of the agent beforehand and also makes asymptotic 
analysis impossible. 

Hyperbolic, d^ = 1/(1 + K{t — k)). The parameter k determines how farsighted 
the agent is with smaller values leading to more farsighted agents. Hyperbolic 
discounting is often used in economics with some experimental studies explaining 
human time-inconsistent behavior by suggesting that we discount hyperbolically 
[ThaSl]. The hyperbolic discount matrix is not summable, so may be replaced by 
the following (similar to [Hut04]), which has similar properties for f3 close to 1. 

d^ = 1/(1 + K{t - k)y with /3 > 1. 

Geometric, d'j: = 7* with 7 G (0, 1). Geometric discounting is the most commonly 
used discount matrix. Philosophically it can be justified by assuming an agent will 
die (and not care about the future after death) with probability 1 — 7 at each time- 
step. Another justification for geometric discount is its analytic simplicity - it is 
summable and leads to time-consistent policies. It also models fixed interest rates. 

No Discounting, d^ = 1, forallA;,t. [LH07] and [Leg08] point out that dis- 
counting future rewards via an explicit discount matrix is unnecessary since the 
environment can capture both temporal preferences for early (or late) consumption, 
as well as the risk associated with delaying consumption. Of course, this "discount 
matrix" is not summable, but can be made to work by insisting that all environ- 
ments satisfy Assumption 7. This approach is elegant in the sense that it eliminates 

^ {exprj = 1 if expr is true and otherwise. 



the need for a discount matrix, essentially admitting far more complex preferences 
regarding inter-temporal rewards than a discount matrix allows. On the other hand, 
a discount matrix gives the "controller" an explicit way to adjust the myopia of the 
agent. 

To illustrate the potential consequences of time- ^-^ o ^-^ o /^ o ^-n 
inconsistent discount matrices we consider the poll- ^-^ ^S"^ \^ ^-^ 

A A A A 

cies of several agents acting in the following environ- ^' ^ ^"1 ^ \ ^ 
ment. Let agent A use a constant horizon discount {£) — ^Qj — "Qj — *-Q_) — *■ 
matrix with H = 2 and agent B a geometric discount 
matrix with some discount rate 7. 

In the first time-step agent A prefers to move right with the intention of moving 
up in the second time-step for a reward of 2/3. However, once in second time-step, 
it will change its plan by moving right again. This continues indefinitely, so agent 
A will always delay moving up and receives zero reward forever. 

Agent B acts very differently. Let Tit be the policy in which the agent moves right 
until time-step t, then up and right indefinitely. VJ^ (/i<i) = 7*(^^- This value does 

not depend on k and so the agent will move right until t = argmax < 7* -7^2 r < 00 
when it will move up and receive a reward. 

The actions of agent A are an example of the worst possible behavior arising 
from time-inconsistent discounting. Nevertheless, agents with a constant horizon 
discount matrix are used in all kinds of problems. In particular, agents in zero sum 
games where fixed depth mini-max searches are common. In practise, serious time- 
inconsistent behavior for game-playing agents seems rare, presumably because most 
strategic games don't have a reward structure similar to the example above. 

4 Theorems 

The main theorem of this paper is a complete characterisation of time consistent 
discount matrices. 

Theorem 13 (Characterisation). Let d be a discount matrix, then the following are 
equivalent. 

1. d is time- consistent (Definition 12) 

2. For each k there exists an ak E9. such that d^ = atdj for all t > k E N . 

Recall that a discount matrix is sliding if d^ = dl_f^_^-j^. Theorem 13 can be used 
to show that if a sliding discount matrix is used as in [Str55] then the only time- 
consistent discount matrix is geometric. Let d be a time-consistent sliding discount 
matrix. By Theorem 13 and the definition of sliding, aidl_^-^^ = d'f_^-^^ = d\. Therefore 

-^d\ = d\ and rfg = -^d^ = ( ^ ) d\ and similarly, d] = i-^j (ij oc 7* with 



7 = 1/tti, which is geometric discounting. This is the analogue to the results of 
[Str55] converted to our setting. 

The theorem can also be used to construct time-consistent discount rates. Let 
d^ be a discount vector, then the discount matrix defined by (i^ := dj for all t > k 
will always be time-consistent, for example, the fixed lifetime discount matrix with 
d'l = 1 ii t < H for some horizon H. Indeed, all time-consistent discount rates can 
be constructed in this way (up to scaling). 

Proof of Theorem 13. 2 =^ 1: This direction follows easily from linearity of the 
scalar product. 

^*dk{h<t) = argmaxVJfc(/i<i) = ai g max K^{h<^t) ■ d^ = ai g max R^{h<^t) ■ ctkd^ 

TT TV TV 

(2) 
= argm^ax akR^{h^t) ■ d^ = arg m^ax R^ (h^t) ■ d^ = ii*^i{h^t) 

TV TV 

as required. The last equality of (2) follows from the assumption that d^ = audi for 
a\\t>k and because R^{h^t)i = for all i < t. 

1 =^ 2: Let d and d be the discount vectors used at times and k respectively. 
Now let A; < ti < t2 < ■ • ■ and consider the deterministic environment below where 
the agent has a choice between earning reward ri at time ti or r2 at time t2- In this 
environment there are only two policies, tti and 7C2, where R^^{h^k) = f^i^ti and 
R^^{h<k) = i^2^t2 with Ci the infinite vector with all components zero except the 
ith, which is 1. 





Since d is time-consistent, for all ri,r2 G 7^ and A; G N we have: 

argmax VJi(/i<fc) = argm^ax R^ (h<:k) ■ d^ (3) 

TV TV 

= argm.ax R^ (h^k) ■ d^ = argmax VJfc(/i<fc). (4) 

TV TV 

Now \/;,^ > \/;,^ if and only if d^ ■ [R'^'ih^k) - R'^'ih^k)] = [<, <] ■ [n, -ra] > 0. 
Therefore we have that, 

[4, 4J ■ [n, -r,] > ^ [<, <] ■ [ri, -r,] > 0. (5) 

Letting cos^a: be the cosine of the angle between [d^^, d^J and [ri, — r2] then Equation 
(5) becomes cos^o > <^ cos6k > 0. Choosing [ri,— r2] oc [dl^,—dl^] implies that 
cos^o = and so cos^^ = 0. Therefore there exists a^ G M such that 

[<,<] = a,[4,4]. (6) 



Let A; < ti < ^2 < ^3 < • ■ ■ be a sequence for which dj. > 0. By the previous 
argument we have that, [^'C^ti+J = "fc[4,4+J and [4^^,4^J = ak[dl^^,dl^J. 
Therefore a^ = a^, and by induction, dp = akd\. for all i. Now if t > A; and d\ = Q 
then (i^ = by equation (6). By symmetry, rf^ = =^ d\ = 0. Therefore 
df = akd\ for all t > /c as required. D 

In Section 3 we saw an example where time-inconsistency led to very bad be- 
havior. The discount matrix causing this was very time-inconsistent. Is it possible 
that an agent using a "nearly" time-consistent discount matrix can exhibit similar 
bad behavior? For example, could rounding errors when using a geometric discount 
matrix seriously affect the agent's behavior? The following Theorem shows that this 
is not possible. First we require a measure of the cost of time-inconsistent behavior. 
The regret experienced by the agent at time zero from following policy tx^ rather 
than 7r*i is V*i{h^i) — V^-^{h^i). We also need a distance measure on the space of 
discount vectors. 

Definition 14 (Distance Measure). Let d ,d^ be discount vectors then define a 
distance measure D by 

oo 

D{d^d^):= Y. I^'-^'I- 

i=max{k,j} 

Note that this is almost the taxicab metric, but the sum is restricted to i > 
max{A;, j}. 

Theorem 15 (Continuity). Suppose e > and Dkj '■= D{d ,<i"') then 

t-i 



k=l 



with t = min < t : ^^ ^ P(/i<(|7r*i)y^i(/i<t) < e k which for e > is guaranteed to 
exist by Assumption 7. 

Theorem 15 implies that the regret of the agent at time zero in its future time- 
inconsistent actions is bounded by the sum of the differences between the discount 
vectors used at different times. If these differences are small then the regret is also 
small. For example, it implies that small perturbations (such as rounding errors) in 
a time-consistent discount matrix lead to minimal bad behavior. 

The proof is omitted due to limitations in space. It relies on proving the result 
for finite horizon environments and showing that this extends to the infinite case by 
using the horizon, t, after which the actions of the agent are no longer important. 
The bound in Theorem 15 is tight in the following sense. 



Theorem 16. For 6 > and t &N and any sufficiently small e > there exists an 
environment and discount matrix such that 



it - 2)(1 - e)5 < \/;.(/i<i) - V;-{h^r) < {t + 1)5 

t- 



t-i 



where t = min It : ^f^^^P{h^t\'^'^i)V*i{h^t) = o\ < oo and where D{df',dP) = 
Dk,j = 5 for all k,j. 

Note that t in the statement above is the same as that in the statement of 
Theorem 15. Theorem 16 shows that there exists a discount matrix, environment 
and e > where the regret due to time-inconsistency is nearly equal to the bound 
given by Theorem 15. 

Proof of Theorem 16. Define d by 

. {5 a k <i <t 

1 otherwise 

Observe that D{d , d?) = 6 for all k < j < t since dj = d\ for all i except i = j. 
Now consider the environment below. 

r-\ /~\ 





l-e-' 



For sufficiently small e, the agent at time zero will plan to move right and then down 
leading to R*ii{h<i) = [0, 1 - e, 1 - e, ■ ■ ■] and V*^{h<i) = {t - 1)6(1 - e). 

To compute Rd note that d^ = for all k. Therefore the agent in time-step 
k doesn't care about the next instantaneous reward, so prefers to move right with 
the intention of moving down in the next time-step when the rewards are slightly 
better. This leads to Rdih^i) = [0, 0, ■ ■ ■ , 1 - e*-\ 0, 0, ■ ■ ■]. Therefore, 

y:^ih<i) - V;^{h^,) = {t- 1)6(1 - e) - (1 - e'-')6 >{t- 2)6(1 - e) 

as required. D 

5 Game Theoretic Approach 

What should an agent do if it knows it is time inconsistent? One option is to treat 
its future selves as "opponents" in an extensive game. The game has one player per 

10 



time-step who chooses the action for that time-step only. At the end of the game 
the agent will have received a reward sequence r G TZ°°. The utility given to the kth 
player is then r ■ d . So each player in this game wishes to maximise the discounted 
reward with respect to a different discounting vector. 

For example, let d} = [2, 1, 2, 0, 0, ■ ■ ■] and d^ = r'\_}_^r~\J_^r~^ 
[*, 3, 1, 0, 0, ■ ■ ■] and consider the environment on the right. Ini- V^^ ^-^ ^-^ 
tially, the agent has two choices. It can either move down to l 
guarantee a reward sequence of r = [4, 0, 0, ■ ■ ■ ] which has util- () 
ity of d^ ■ [4, 0, 0, ■ ■ ■ ] = 8 or it can move right in which case it ^ 
will receive a reward sequence of either r' = [1, 3, 0, 0, ■ ■ ■] with 
utility 5 or r" = [1, 1, 3, 0, 0, • • •] with utility 9. Which of these 
two reward sequences it receives is determined by the action taken in the second 
time-step. However this action is chosen to maximise utility with respect to discount 
sequence d^ and d^ ■ r' > d^ ■ r". This means that if at time 1 the agent chooses to 
move right, the final reward sequence will be [1, 3, 0, 0, ■ ■ ■] and the final utility with 
respect to d^ will be 5. Therefore the rational thing to do in time-step 1 is to move 
down immediately for a utility of 8. 

The technique above is known as backwards induction which is used to find 
sub-game perfect equilibria in finite extensive games. A variant of Kuhn's theorem 
proves that backwards induction can be used to find such equilibria in finite extensive 
games [OR94]. For arbitrary extensive games (possibly infinite) a sub-game perfect 
equilibrium need not exist, but we prove a theorem for our particular class of infinite 
games. 

A sub-game perfect equilibrium policy is one the players could agree to play, and 
subsequently have no incentive to renege on their agreement during play. It isn't 
always philosophically clear that a sub-game perfect equilibrium policy should be 
played. For a deeper discussion, including a number of good examples, see [OR94]. 

Definition 17 (Sub-game Perfect Equilibria). A policy tt^ is a sub-game perfect 
equilibrium pohcy if and only if for each t Vj^'*(/i<t) > V^t{h^t), for all /i<t, where 
vf is any policy satisfying 7f(/i<i) = iT^{h^i)\/h^i where i j^ t. 

Theorem 18 (Existence of Sub-game Perfect Equilibrium Policy). For all environ- 
ments and discount matrices d satisfying Assumptions 1 and 7 there exists at least 
one sub-game perfect equilibrium policy tt^. 

Many results in the literature of game theory almost prove this theorem. Our 
setting is more difficult than most because we have countably many players (one 
for each time-step) and exogenous uncertainty. Fortunately, it is made easier by the 
very particular conditions on the preferences of players for rewards that occur late 
in the game (Assumption 7). The closest related work appears to be that of Drew 
Fudenberg in [Fud83], but our proof (see appendix) is very different. The proof idea 
is to consider a sequence of environments identical to the original environment but 
with an increasing bounded horizon after which reward is zero. By Kuhn's Theorem 

11 



[OR94] a sub-game perfect equilibrium policy must exist in each of these finite games. 
However the space of policies is compact (Lemma 23) and so this sequence of sub- 
game perfect equilibrium policies contains a convergent sub-sequence converging to 
policy vr. It is not then hard to show that tt is a sub-game prefect equilibrium policy 
in the original environment. 

Proof of Theorem 18. Add an action a*^"^"*^ to A and ^ such that if a"^^""^^ is taken 
at any time in h^^t then /i returns zero reward. Essentially, once in the agent takes 
action a'^^"'^^, the agent receives zero reward forever. Now if vr^ is a sub-game per- 
fect equilibrium policy in this modified environment then it is a sub-game perfect 
equilibrium policy in the original one. 

For each t G N choose tt^ to be a sub-game perfect equilibrium policy in the 
further modified environment obtained by setting Vi = ii i > t. That is, the 
environment which gives zero reward always after time t. We can assume without 
loss of generality that nt{h^k) = a'^'^"'*^ for all k > t. Since 11 is compact, the 
sequence 7ii,7i2,- ■ ■ has a convergent subsequence tt^j, TTtj, ■ ■ ■ converging to tt and 
satisfying 

1. Titi{h<k) = 7r(/i<fc), for all /i<fc where k <i. 

2. Tit- is a sub-game perfect equilibrium policy in the modified environment with 
reward r^ = if /c > tj. 

3- T^tiih^tJ = O-death- 

We write V^*» for the value function in the modified environment. It is now shown 
that vr is a sub-game perfect equilibrium policy in the original environment. Fix a 
t G N and let tt be a policy with 7r(/i<fe) = 7r(/i<^.) for all /i<fc where k ^ t. Now 
define policies vrj^ by 



^u{h 



<k) 




iik<i 

otherwise 



By point 1 above, ntXh-Kk) = T^tXh^k) for all h^k where k ^ t. Now for alH > t we 
have 

VJ.{h^t) > V^J^h^t) - \V2.{h^t) - Vi:Xh^,)\ (7) 

>v;/^(M-|v,MM-v,?'(MI (8) 

> Vp{h^^) - \V2.{h<t) - V;:Xh^,)\ (9) 

- \Vp{h^t) - Vp{h^^\ - \V^jXh^,) - V^.{h^,)\ (10) 

where (7) follows from arithmetic. (8) since V > V. (9) since ttj- 

is a sub-game perfect equilibrium policy. (10) by arithmetic. We now 

12 



show that the absolute value terms in (10) converge to zero. Since V'^{-) 
is continuous in vr and limj_j>oo t^u = tt and linij^oo t^u = tt, we obtain 

lim,^oo [iVJ^ih^t) - V;;:^{h^t)\ + \Vp{h^t) - V^t{h^t)\\ = O. Now nt^ih^k) = a"^"*^ 

a k > ti, so |V^'i(/i<t) — V'^*«(/i<()| = 0. Therefore taking the limit as i goes to 
infinity in (10) shows that VJ^(/i«) > ^jt(^<t) as required. D 

In general, tt^ need not be unique, and different sub-game equilibrium policies can 
lead to different utilities. This is a normal, but unfortunate, problem with the sub- 
game equilibrium solution concept. The policy is unique if for all players the value 
of any two arbitrary policies is different. Also, if VA;(VTfe^ = VJa?' =^ '^J'^d^ ^ '^d^) 
is true then the non-unique sub-game equilibrium policies have the same values 
for all agents. Unfortunately, neither of these conditions is necessarily satisfied in 
our setup. The problem of how players might choose a sub-game perfect equilibrium 
policy appears surprisingly understudied. We feel it provides another reason to avoid 
the situation altogether by using time-consistent discount matrices. The following 
example illustrates the problem of non-unique sub-game equilibrium policies. 

Example 19. Consider the example in Section 3 with an agent using a constant 
horizon discount matrix with H = 2. There are exactly two sub-game perfect 
equilibrium policies, tti and tt2 defined by, 

I up if t is odd I up if t is even 

\ right otherwise \ right otherwise 

Note that the reward sequences (and values) generated by tti and %2 are different 
with W{h<i) = [1/2, 0,0,---] and i?^^(/i<i) = [0, 2/3, 0, 0, ■ ■ ■]. If the players 
choose to play a sub-game perfect equilibrium policy then the first player can choose 
between tti and 7T2 since they have the first move. In that case it would be best to 
follow TT2 by moving right as it has a greater return for the agent at time than tti. 

For time-consistent discount matrices we have the following proposition. 

Proposition 20. Ifd is time- consistent then V*k = V^,f = V ,f for all k and choices 
o/7r*fe and 71 d and n^. 

Is it possible that backwards induction is simply expected discounted reward 
maximisation in another form? The following theorem shows this is not the case 
and that sub-game perfect equilibrium policies are a rich and interesting class worthy 
of further study in this (and more general) settings. 

-0 
Theorem 21. 3d such that tt^ 7^ tt*-,,, for all d . 

The result is proven using a simple counter-example. The idea is to construct 
a stochastic environment where the first action leads the agent to one of two sub- 
environments, each with probability half. These environments are identical to the 
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example at the start of this section, but one of them has the reward 1 (rather than 
3) for the history right, down. It is then easily shown that tt^ is not the result of 
an expectimax expression because it behaves differently in each sub- environment, 
while any expectimax search (irrespective of discounting) will behave the same in 
each. 



6 Discussion 

Summary. Theorem 13 gives a characterisation of time- (in) consistent discount 
matrices and shows that all time-consistent discount matrices follow the simple 
form of d!l = d\. Theorem 15 shows that using a discount matrix that is nearly 
time-consistent produces mixed policies with low regret. This is useful for a few 
reasons, including showing that small perturbations, such as rounding errors, in 
a discount matrix cannot cause major time-inconsistency problems. It also shows 
that "cutting off" time-consistent discount matrices after some fixed depth - which 
makes the agent potentially time-inconsistent - doesn't affect the policies too much, 
provided the depth is large enough. When a discount matrix is very time-inconsistent 
then taking a game theoretic approach may dramatically decrease the regret in the 
change of policy over time. 

Some comments on the policies tx*^ (policy maximising expected d -discounted 
reward), TTd (mixed policy using vr*^. at each time-step t) and tt^ (sub-game perfect 
equilibrium policy). 

1. A time-consistent agent should play policy 7[*k = ttj for any k. In this case, 
every optimal policy vr*^. is also a sub-game perfect equilibrium policy. 

2. TTd will be played by an agent that believes it is time-consistent, but may not 
be. This can lead to very bad behavior as shown in Section 3. 

3. An agent may play tt^ if it knows it is time-inconsistent, and also knows 
exactly how (I.e, it knows d for all k at every time-step). This policy is 
arguably rational, but comes with its own problems, especially non-uniqueness 
as discussed. 

Assumptions. We made a number of assumptions about which we make some brief 
comments. 

1. Assumption 1, which states that A and O are finite, guarantees the existence 
of an optimal policy. Removing the assumption would force us to use e-optimal 
policies, which shouldn't be a problem for the theorems to go through with an 
additive e slop term in some cases. 

2. Assumption 7 only affects non-summable discount vectors. Without it, even 
e-optimal policies need not exist and all the machinery will break down. 
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3. The use of discrete time greatly reduced the complexity of the analysis. Given 
a sufficiently general model, the set of continuous environments should contain 
all discrete environments. For this reason the proof of Theorem 13 should go 
through essentially unmodified. The same may not be true for Theorems 15 
and 18. The former may be fixable with substantial effort (and perhaps should 
be true intuitively). The latter has been partially addressed, with a positive 
result in [G0I8O, PY73, P0I68, Str55]. 
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A Technical Proofs 

Before the proof of Theorem 8 we require a definition and two lemmas. 

Definition 22. Let 11 = A^ be the set of all policies and define a metric D 
on n by T{7Ti,7T2) := minjgf^ {t : 3/i<t s.t 7ri(/i<t) ^ vr2(/i<f)} or oo if tti = 7T2 and 
0(1x1,712) := exp(-T(7ri,7r2)). 

T is the time-step at which tti and 112 first differ. Now augment 11 with the 
topology induced by the metric d. 

Lemma 23. 11 is compact. 

Proof. We proceed by showing 11 is totally bounded and complete. Let e = exp(— t) 
and define an equivalence relation by tt ~ vr' if and only if T(7ri, 712) > t. If tt ~ tt' 
then 0(71,71') < e. Note that 11/^ is finite. Now choose a representative from each 
class to create a finite set n. Now UvrGn ^^i^) ~ n, where B^(7r) is the ball of radius 
e about tt. Therefore 11 is totally bounded. 

Next, to show 11 is complete. Let ni,7i2,--- be a Cauchy sequence with 
D(7ii,7!-i+j) < exp(—i) for all j > 0. Therefore 7ii(h^k) = T^i+j(h<k)'^h^k with 
k < i, hy the definition of D. Now define tt by 7T(h^t) '■= '^t(h<t) and note that 
T^i(h<j) = 7r(/i<j)Vj < i since 7ii(h^k) = 'T^k(h<k) = 7r(/i<fc) for k < i. Therefore 
limj^oo TTj = TT and so 11 is complete. Finally, LI is compact by the Heine-Borel 
theorem. D 

Lemma 24. When viewed as a function from U to M, K7fc(') ^-^ continuous, (given 
Assumption 7) 

Proof Suppose /^(tti, 7r2) < exp(— t) then tti and 7r2 are identical on all histories up 
to length t. Therefore 

\V^.'(h<k) - V;;,'(h^,)\ < d'^ ■ [K'^(h^k) + BI''(h<k)] 

00 

= Y,d'dR''Kh<k)^ + RT(h^k)i) ■ (11) 

i=k 

Since vri and 7r2 are identical up to time t, (11) becomes 

00 

Y,d'^{R''Kh<k)^ + R?{h<k)i) = 

i=t 

J2 [Pih^t\h<k,rri)V;;,'(h^t) + P(h<t\h<k,rr2)V^,'(h^t)\\ (12) 

where (12) follows from the definition of the reward and value functions. By As- 
sumption 7, limi_,ooZlA<t-P(^<tl^<fc'^i)^j'fc(^«) = for i G {1,2} and so, V is 
continuous. D 

Proof of Theorem 8. Let LI be the space of all policies with the metric of Def- 
inition 22. By Lemmas 23/24 LI is compact and V is continuous. Therefore 
argmax^ VTfe(/i<i) exists by the extreme value theorem. D 
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B Table of Notation 



Symbol 
d 

4 

k, t 
i 

n 

A,s,o,n 

R{s, a) 
P{s'\s,a) 

N,M 

St 

S<t 

RHh^t) 

-^* 
dr 

TTd 

^d 

V*,{h<t) 

7 

Ofc 

K 

h 

m 

D{TTl,Tr2) 



Description 

Discount Matrix 

Discount Vector k 

The tth component of discount vector d (at time k reward r^ is discounted 

by 4) 

Indices, k usually referring to a discount vector used at fixed time k, t 

usually a time index for states. 

Summing index 

Small real numbers greater than zero 

Policies 

The space of all policies 

Action, state, reward and observation spaces 

The reward given when taking action a in state s 

The probability of transitioning to state s' from state s having taken action 

a 

The natural and real numbers respectively 

The set of all states reachable at time-step t 

The set of all states reachable up to time-step t 

A ball of radius e 

The expected reward sequence when following tt from state /i<j 

The optimal policy when using discount vector d 

The mixed policy using discount matrix d 

The sub-game equilibrium policy using discount matrix d 

The expected reward sequence when following the optimal policy n*,. 

The value of the optimal policy Tr*^. 

Discount rate for geometric discounting 

A real valued scaling factor on a discount vector 

Discount rate for hyperbolic discounting 

Horizon for constant depth discounting 

Lifespan for fixed lifetime discounting 

States in a Markov decision process 

The distance between policies vri and tt2 using the metric of Definition 22 

The distance measure between discount vectors d and d^ as defined by 

Definition 14 
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